## put your code here
library(knitr)
library(readr)
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
library(broom)
This project is based on real complany data to analyze the relationships of Virtual Machine(VM) remote session usage with date time and geolocation of VM and remote clients. The company name remains anonymous in the project. Two data files: Remote session available on GitHub here and Region available on GitHub here.
The company sells and leases many its storage products to customers globally. The audit agent installed on customer’s network allows secure transfer of the error and diagnostic information to be sent back to the Customer Support Center(remote session). Engineers then analyze these error reports and trouble shoot for customers.
The company’s system has been tracking when, where(with IP address), how much and who data transfer comes from. However there is no data analytic tool to analyze these information. The project intends to use R to do statistical computing, depict the remote session allocations graphically, identity capacity shortages, over allocated VMs, hotspot with different geo locations and date time.
With recent three month of remote session data(over one million rows), the goal is to find out whether there is a trend between the number of allocated VM sessions, session duration, data transfer has correlation with geolocation and date time.
Check whether normal distribution is followed and whether linear model can be applied. Here are some key benefits:
The backbone data comes from the company’s database, which tracks on the remote session’s start/end time, installed agent’s IP, the VM name located in different data center, how much has been transferred and the session termination cause. The data will be queried via SQL and extracted into CSV fommat(session.csv). Java program is used to call IP lookup web servoce and extra into CSV format(region.csv)
First load the sessions from CSV file, which is extracted from production DB.
There are over 1 million sessios in the past three months. Each row is correspond to each remote session initiated globally.
Here is a decription of the columns:
Session_ID: Unique identifier of each remote session
Login: Account name to initiate remote session and access remote VM
Start_Time: The timestamp when remote session starts
End_Time: The timestamp when remote session ends
Agent_IP: IP address of the agent installed on product
Termination_Cause: a lookup value of differnt typese of terminate. Normally it is 3.
Data_Transfer: the file transfer from the agent(client) to the VM(Server) in byte Gas_Server_Name: One of the fifteen VMs loccated in five data centers(Cork, Hopkinton, Durham, Santa Clara and Singapore). Each data center has three VMs. Three data centers are located in the US: Hopkinto, Durham and Santa Clara. One of them in Europe: Cork and one in Asia: Singapore.
Duration: The time span between Start_Time and End_Time in minutes
WeekofYear: The week number since January 1, 2016. For example, January 8, 2016 will be 2.
Dateofyear: The date number since January 1, 2016.For example, January 8, 2016 will be 8.
Weekday: Day of the week. For example: SUN, MON.
## put your code here
session <- read_csv(gzfile("~/Downloads/session.csv.gz"))
login_info<-session %>% group_by(LOGIN) %>%
summarize(login_count=n())
head(login_info[order(-login_info$login_count),], n=1 )
## Source: local data frame [1 x 2]
##
## LOGIN login_count
## (chr) (int)
## 1 syr 304708
The highest number of remote sessions was initated by a generic account: SYR.
Do data manipulation to separate the start time to year, month, day, hour, minute and second.
## put your code here
p<-session %>% filter(START_TIME!="null")
p$SDATE <- as.Date(p$START_TIME,"%m/%d/%Y")
p<-p %>% separate(SDATE, into=c("SYEAR", "SMON", "SDATE"),sep="-")
p$STIME<-format(as.POSIXct(strptime(p$START_TIME, "%m/%d/%Y %H:%M:%S")), format="%H:%M:%S")
p<-p%>% separate(STIME, into=c("SHOUR", "SMIN", "SSECOND"),sep=":")
Each data center has three VMs named from 01 to 03. Add a DC column mapped from GAS_Server_Name column
## put your code here
DC <-c(esrgckprd01="Cork",esrgckprd02="Cork",esrgckprd03="Cork",esrghoprd01="Hopkinton", esrghoprd02="Hopkinton", esrghoprd03="Hopkinton", esrgscprd01="Santa Clara", esrgscprd02="Santa Clara", esrgscprd03="Santa Clara", esrgweprd01="Durham", esrgweprd02="Durham",esrgweprd03="Durham", esrgspprd01="Singapore", esrgspprd02="Singapore",esrgspprd03="Singapore")
p$DC <-DC[p$GAS_SERVER_NAME]
Next the trend of between the number of remote sessions and the dayofyear from 1/31/2016 to 4/30/2016. Use point size to denote the average data transfer size in MB.
## put your code here
p %>% group_by(DAYOFYEAR) %>%
summarize(login_count=n(),DATA_TRANSFER=mean(DATA_TRANSFER/1000000)) %>%
ggplot(aes(DAYOFYEAR, login_count, group=1)) +
geom_point(aes( size=DATA_TRANSFER)) + geom_smooth(color="red")+
xlab("Day of Year from 1/31/2016 to 4/30/2016")+ylab("The number of remote sessions") +labs(title="Remote sessions on all VMs")
As a whole, the trend is gradually up when time goes by in the past three months. Interestingly, there is a steeper trend that more remote sessions are loaded over the weekend from 1/31/2016 to 4/30/2016. It appears that more and more people work over the weekend.
What is the trend during 24 hour cycle?
## put your code here
p %>% group_by(SHOUR) %>%
summarize(login_count=n(),DATA_TRANSFER=mean(DATA_TRANSFER/1000000)) %>%
ggplot(aes(SHOUR, login_count, group=1)) +
geom_point(aes( size=DATA_TRANSFER)) + geom_smooth(color="red")+
xlab("24 hour in EST")+ylab("The number of remote sessions") +labs(title="Remote sessions around the hour")
It is concluded that during 24 hour cycle, more number of remote sessions are initiated in the morning of EST. More data transfer happens between 8 PM towards midnight EST.
Across the five data centers, which one has the most loading in number of remote sessions?
## put your code here
p %>% group_by(DC) %>%
summarize(login_count=n()) %>%
#ggplot(aes(DC, login_count)) +
#geom_point()
ggplot(aes(DC, login_count, fill=DC)) +geom_bar(stat="identity")+
geom_hline(aes( yintercept = mean(login_count)))+
xlab("Data Centers")+ylab("The number of remote sessions") +labs(title="Remote sessions across Data Center")
It turns out the data center in Cork takes the most remote sessions. Then what about the average duration considering normal termination(termination cause ==3) and average data transferred?
## put your code here
p %>% filter(!is.na(DURATION) & (TERMINATION_CAUSE==3)) %>% group_by(DC) %>%
summarize(DURATION=mean(as.numeric(DURATION))) %>%
ggplot(aes(DC, DURATION, fill=DC)) +geom_bar(stat="identity")+
geom_hline(aes( yintercept = mean(DURATION)))+
xlab("Data Centers")+ylab("Average remote session durtaion in minutes") +labs(title="Average remote session durtaion across Data Center")
p %>% filter(!is.na(DATA_TRANSFER)) %>% group_by(DC) %>%
summarize(DATA_TRANSFER=mean(DATA_TRANSFER/1000000)) %>%
ggplot(aes(DC, DATA_TRANSFER, fill=DC)) +geom_bar(stat="identity")+
geom_hline(aes( yintercept = mean(DATA_TRANSFER)))+
xlab("Data Centers")+ylab("Average data transfer in MBs") +labs(title="Average data transfer across Data Center")
It appears the durations are evenly distributed acorss the data center while Singapore data center receives the least data transfer.
Let’s concentrate on the number of remote session across VMs in all data centers.
p %>% group_by(GAS_SERVER_NAME) %>%
summarize(login_count=n()) %>%
ggplot(aes(GAS_SERVER_NAME, login_count, fill=GAS_SERVER_NAME)) +geom_bar(stat="identity")+
geom_hline(aes( yintercept = mean(login_count)))+
xlab("VMs across Data Centers")+ylab("The number of remote sessions") +labs(title="Remote sessions across VMs")
p %>% group_by(DAYOFYEAR,DC) %>%
summarize(login_count=n()) %>%
ggplot(aes(x = DAYOFYEAR, y = login_count, color = DC)) + geom_line()+xlab("Day of Year from 1/31/2016 to 4/30/2016")+ylab("The number of remote sessions") +labs(title="The trend of remote sessions for data centers")
The result is consistent with the previous observation: Cork VMs are taking the most loading of remote sessions.
Now consider the geolocation of the client, which is based on the installed agent’s IP address.
First load region.csv, which is generated separately by a Java program calling IP lookup web serivce for each IP address.
Then left join session to region so that each row contains both client and server information.
Look into the data based on the client country.
#region<-read_csv("/Users/chenr1/Documents/data_science/FinalProject/2016FinalProject/region.csv")
region <- read_csv(gzfile("~/Downloads/region.csv.gz"))
p<-left_join(p, region, by=c("AGENT_IP"="IP"))
login_info<-p %>% group_by(Country) %>%
summarize(login_count=n())
head(login_info[order(-login_info$login_count),], n=10 )
## Source: local data frame [10 x 2]
##
## Country login_count
## (chr) (int)
## 1 US 585486
## 2 DE 63767
## 3 GB 44149
## 4 FR 39580
## 5 BR 34879
## 6 JP 32798
## 7 IT 30022
## 8 ES 29664
## 9 AU 25195
## 10 SE 19644
login_info<-p %>% filter(Country=="US")%>% group_by(Region) %>%
summarize(login_count=n())
head(login_info[order(-login_info$login_count),], n=5 )
## Source: local data frame [5 x 2]
##
## Region login_count
## (chr) (int)
## 1 NY 71204
## 2 Unknown 70584
## 3 CA 56619
## 4 TX 42856
## 5 NJ 40497
Client remote sessions coming from the US are way ahead of other countries. Within the US, NY is the leader. Unknown region, coming next, means the IP address is internal or can not be determined.
p %>% group_by(Country) %>%
summarize(login_count=n()) %>% filter(login_count>10000) %>%
mutate(Country=reorder(Country,login_count)) %>%
ggplot(aes(Country, login_count, fill=Country)) +geom_bar(stat="identity")+xlab("Countries with more than 10,000 remote sessions from 1/31/216 to 4/30/2016")+ylab("The number of remote sessions") +labs(title="Remote session distribution")
Based on the graph, the number of remote sessions coming from US is way ahead of other countries.
Based on the previous obvervation, the load on Cork data center was hit most. It appears a number of US remote sessions going to Cork Data Center even though there are three US data centers, which is inefficient.
p %>% filter(DC=="Cork") %>%group_by(Country) %>%
summarize(login_count=n()) %>% filter(login_count>10000) %>%
mutate(Country=reorder(Country,login_count)) %>%
ggplot(aes(Country, login_count, fill=Country)) +geom_bar(stat="identity")+xlab("Countries with more than 10,000 remote sessions from 1/31/216 to 4/30/2016")+ylab("The number of remote sessions") +labs(title="Remote session distribution in Cork data center")
The graph above confirms: the number of remote sessions coming from the US ranks second among european countries for Cork data center.
Take another angle, of all the remote ressions coming from US, what about their distribution of the data centers?
p %>% filter(Country=="US") %>%group_by(DC) %>%
summarize(login_count=n()) %>%
ggplot(aes(DC, login_count, fill=DC)) +geom_bar(stat="identity")+
geom_hline(aes( yintercept = mean(login_count)))+xlab("Data Centers")+ylab("The number of remote sessions") +labs(title="Data center distribution of the remote sessions coming from the US")
Obviously the US data center in Santa Clara is not fully used while a number of US remote sessions are sent to Cork data center. Connection delay is expected. Based on the business knowlege, the agent uses its configured timezone to decide which data center to access. It appears some the configuration are wrong.
Split the dta into groups by data center considering the number of remote sessions, duration and data transfer.
theme_set(theme_bw())
DataCenter<-c("Cork", "Hopkinton", "Durham","Santa Clara","Singapore")
p %>%
group_by(DAYOFYEAR, DC) %>%
summarize(login_count=n(),DATA_TRANSFER=mean(DATA_TRANSFER/1000000)) %>%
filter(DC %in% DataCenter) %>%
ggplot(aes(x = DAYOFYEAR, y = login_count, size=DATA_TRANSFER)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~DC, scales = "free")+xlab("Day of Year from 1/31/2016 to 4/30/2016")+ylab("Average number of remote sessions") +labs(title="Remote session trend across Data Centers")
#Duration scross Data center
p %>% filter(TERMINATION_CAUSE==3) %>%
group_by(DAYOFYEAR, DC) %>%
summarize(DURATION=mean(DURATION)) %>%
filter(DC %in% DataCenter) %>%
ggplot(aes(x = DAYOFYEAR, y = DURATION)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~DC, scales = "free")+xlab("Day of Year from 1/31/2016 to 4/30/2016")+ylab("Duration of remote session in minutes") +labs(title="Duration of remote session on each Data Center")
#Data Transfer scross Data Center
p %>%
group_by(DAYOFYEAR, DC) %>%
summarize(DATA_TRANSFER=mean(DATA_TRANSFER)/1000000) %>%
filter(DC %in% DataCenter) %>%
ggplot(aes(x = DAYOFYEAR, y = DATA_TRANSFER)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~DC, scales = "free")+xlab("Day of Year from 1/31/2016 to 4/30/2016")+ylab("Data Transfer in MBs ") +labs(title="Data transfer of remote session on each Data Center")
#Check the remote login coming from US across Data ceter
p %>% filter(Country=="US") %>%
group_by(DAYOFYEAR, DC) %>%
summarize(login_count=n()) %>%
filter(DC %in% DataCenter) %>%
ggplot(aes(x = DAYOFYEAR, y = login_count)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~DC, scales = "free")+xlab("Day of Year from 1/31/2016 to 4/30/2016")+ylab("Number of remote sessions") +labs(title="Remote login coming from US across Data center")
#Each data center has three VMs for the remote sessions
VM <-c("esrgckprd01","esrgckprd02","esrgckprd03","esrghoprd01","esrghoprd02","esrghoprd03","esrgweprd01","esrgweprd02","esrgweprd03","esrgscprd01","esrgscprd02","esrgscprd03","esrgspprd01","esrgspprd02","esrgspprd03")
# check remote sessions cross each VM
p %>% group_by(DAYOFYEAR, GAS_SERVER_NAME) %>%
summarize(login_count=n()) %>%
filter(GAS_SERVER_NAME %in% VM) %>%
ggplot(aes(x = DAYOFYEAR, y = login_count)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~GAS_SERVER_NAME, scales = "free")+xlab("Day of Year from 1/31/2016 to 4/30/2016")+ylab("Average number of remote sessions") +labs(title="Cork:ckxxx,Hop:hoxxx,Durham:wexxx,Santa Clara:scxxx,Singapore:spxxx")
The number of remote sessions increases consistently in the US data centers(Durham, Hopkinton and Santa Clara) and Singapore with data/time. The data center in Santa Clara and Sigapore are not fully used.
Duration of remote sessions in each data center reduces with data/time.
The remote sessions coming from the US is routed to Santa Clara in a faster pace than any other data center.
Check whether the number of remote session, average Data Transfer and session duration is in normal distribution on any one of the DC. Pick Durham for example.
## put your code here
x<-p %>%
group_by(DAYOFYEAR) %>%
summarize(login_count=n())
fit<-x %>% lm(login_count ~ DAYOFYEAR, data = .)
hist(x$login_count)
x %>%
ggplot(aes(DAYOFYEAR, login_count)) +
geom_point() +
geom_abline(intercept = fit$coef[1],
slope = fit$coef[2])+xlab("Day of Year from 1/31/2016 to 4/30/2016 except weekends")+ylab("Average number of remote sessions")
sd(x$login_count)
## [1] 4235.624
qqnorm(x$login_count)
qqline(x$login_count)
The Normal Q-Q plot shows the bulk number of remote sessions with day of year does not fit a good normal distribultion(big tails!).
From the previous observation, the remote session count over the weekend is way below that of weekday. Also the remote session count from the US is way higher than other contries. Then let’s see whether the number is in normal distribution if we filter out these two variables.
x<-p %>%
filter(Country=="US"&(WEEKDAY!="SAT" & WEEKDAY !='SUN')) %>%
group_by(DAYOFYEAR) %>%
summarize(login_count=n())
fit<-x %>% lm(login_count ~ DAYOFYEAR, data = .)
hist(x$login_count)
x %>%
ggplot(aes(DAYOFYEAR, login_count)) +
geom_point() +
geom_abline(intercept = fit$coef[1],
slope = fit$coef[2])+xlab("Day of Year from 1/31/2016 to 4/30/2016 except weekends")+ylab("Average number of remote sessions from US")
sd(x$login_count)
## [1] 1078.45
qqnorm(x$login_count)
qqline(x$login_count)
#mean(x$login_count) + c(-1,1)*qnorm(0.975)*sd(x$login_count)
x<-p %>%
filter(Country!="US"&(WEEKDAY!="SAT" & WEEKDAY !='SUN')) %>%
group_by(DAYOFYEAR) %>%
summarize(login_count=n())
fit<-x %>% lm(login_count ~ DAYOFYEAR, data = .)
hist(x$login_count)
x %>%
ggplot(aes(DAYOFYEAR, login_count)) +
geom_point() +
geom_abline(intercept = fit$coef[1],
slope = fit$coef[2])+xlab("Day of Year from 1/31/2016 to 4/30/2016 except weekends")+ylab("Average number of remote sessions not from US")
sd(x$login_count)
## [1] 1346.771
qqnorm(x$login_count)
qqline(x$login_count)
The regression line shows the number of remote session from the US increases but the number of from non-US countries decreases. Q-Q plots confirms that the linear models fits the data well.
The bulk data is not in perfect normal distribution. However if the remote session s filtered by the source(Country) and remove weekend data(because the remote session count is much lower than weekdays), the filtered data is in normal distribution and confirmed by Mormal Q-Q plot.
Santa Clara and Singapore data center’s VM are relatively lightly used by the remote sessions while Cork’s data center is loaded heavily. Santa Clara and Singapore data centers should be used effectively.
A number of agents, because their timezone are not configured correctly, are routing to the data centers far away from their geolication. For example, a number of remote sesions coming from the US is routed to Cork, even to Singapore, even though the data center in Santa Clara has enough capacity.
The general trend of the number of remote sessions rise gradually with dates. The number of remote sessions rises dramatically over the weened although it is much lower than weekdays. The duration of remote session to each data center reduces consistently.
The number of remote sessions coming from the US is way ahead of that of other countries. While the number of remote session from the US increases, the number of from non-US countries decreases.